Direct Extension of Density-Matrix Renormalization Group toward 2-Dimensional 
Quantum Lattice Systems: Studies for Parallel Algorithm, Accuracy, and Performance 
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We parallelize density-matrix renormalization group to directly extend it to 2-dimensional (n- 
leg) quantum lattice models. The parallelization is made mainly on the exact diagonalization for 
the superblock Hamiltonian since the part requires an enormous memory space as the leg number n 
increases. The superblock Hamiltonian is divided into three parts, and the correspondent superblock 
vector is transformed into a matrix, whose elements are uniformly distributed into processors. The 
parallel efficiency shows a high rate as the number of the states kept m increases, and the eigenvalue 
converges within only a few sweeps in contrast to the multichain algorithm. 

PACS numbers: 71.10.Fd, 71.10.Pm, 74.20.Mn, 03.75.Ss 
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The superfluidity achieved in atomic Fermi gas [l| is 
quite useful in studying strongly-coupled superfluidity. 
Very recently, such a success has intensively pushed ex- 
perimentalists to find another type of superfluidity, which 
emerges on strongly-correlated 2-dimensional (2-D) lat- 
tice system. This is because the so-called "optical lattice" 
build in atomic gases may offer a testbed to directly solve 
the Hubbard model and related controversial issues in 
High-Tc cuprate superconductors in a controllable man- 
ner [1]. 

So far, several computational approaches have been 
proposed in order to study strongly-correlated lat- 
tice fermions. Among them, three methods, i.e., the 
exact-diagonalization, the density-matrix renormaliza- 
tion group (DMRG) [1, 1^, and the quantum Monte 
Carlo are widely employed as standard and established 
ones. However, 2-D systems are too complicated for these 
methods to uncover, and the ground-states in the most 
2-D models are still open problems. In this paper, we 
therefore suggest a parallel algorithm to directly extend 
DMRG to 2-D models. 

The DMRG method, which was originally aimed for 
1-D lattice model, can be extended to 2-D models (n- 
leg models) as depicted in Fig.[l](a). Then, the number 
of the states required in the direct algorithm is roughly 
given as 16*m^ [A^m?) for s-leg Hubbard model (the s-leg 
Heisenberg model) per block, in which m is the number of 
states kept. Although the degree of freedom practically 
decreases with eliminating irrelevant states, it is clear 
that a slight increment of the leg gives rise to an expo- 
nential like growth of the state number. Thus, the direct 
extension has been limited within 2-leg Q, and the pre- 
vious 2-D DMRG has adopted the so-called multichain 
algorithm as depicted in Fig. [T] (b) since its memory space 
is basically comparable to the 1-D case [1]. However, the 
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FIG. 1: (a) A superblock configuration to directly perform 
DMRG for 2-D lattice models, (b) The conventional con- 
figuration using "multichain" algorithm to avoid enormous 
memory expansion. 



multichain algorithm has difficulties in its convergence 
property and accuracy [1, Q . 

The direct 2-D extension of DMRG guarantees high 
accuracy similar to 1-D cases, although it requires an 
enormous memory space. Thus, if possible, it is valuable 
to parallelize the direct 2-D DMRG and obtain a scal- 
able code, which enables to raise the number of legs with 
increasing computational resources. The present-day big 
supercomputers have a tera-byte order of memory. We 
therefore claim that a scalable algorithm may be crucial 
in advancing computational research on 2-D models. We 
examine the parallel efficiency as well as the convergence 
property of 2-D direct DMRG on a parallel supercom- 
puter Altix 3700Bx2 in JAEA. 

Let us explain the parallel algorithm. In the direct 
2-D DMRG as shown in Fig. [1] (a) , a routine which con- 
sumes most of the computer resources, i.e., memory and 
CPU time, is the exact diagonalization of the superblock 
Hamiltonian H . Inside the routine, a major operation is 
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FIG. 2: A superblock of 2-D direct DMRG, which is com- 
posed of four blocks named "block 1", "block 2", "block 3", 
and "block 4" from the left. See the text for the index ij 
(top panel). Hi, Hr, and He (bottom panel) denote the de- 
composed block Hamiltonians in the left, the right, and the 
central block, respectively. 



the multiplication between the Hamiltonian matrix and 
the vector, i.e., Hv. This is the most basic operation re- 
peated over and again in the exact diagonalization and 
DMRG. In general, the parallelization of the multiplica- 
tion between the sparse Hamiltonian matrix ^] and the 
vector can be simply realized by distributing the sparse 
matrix rowwisely. However, it is difficult to obtain a good 
load balance for the cases like the interacting 2-D lattice 
fermions, whose non-zero element distribution is not so 
regular. In this paper, we, therefore, propose an alter- 
native parallel strategy, which transforms the superblock 
vector into a matrix form, and distribute the matrix into 
processors. 

Let us write down the algorithm. Each block of the 
superblock is called "block 1" , "block 2" , "block 3" , and 
"block 4" from the left, and the state of the "block j" is 
represented as ij (see the top panel of Fig. [21). Then, the 
Hamiltonian matrix H, 



(= H) is given by 



ili2i3*4;j'ij2'3*4 

~^-^i2i3;i'2i'3^ili4;i'ii'4^ (-'-) 

where Si;j is the Kronecker's delta, and Hi-^i^-i'^i/^, 
Hi^i^-i'^i'^, and iJ^^iaii^i!, are the block Hamiltonian ma- 
trices in the left block, the right block, and the central 
block, respectively. In the following, we express them 
as Hi, Hr, and He (see the bottom panel of Fig. ^ 
for simplicity. Here, we put the {{ij, — \)m?n -f [i^ — 
\)mn + (i2 — 1)"^ + zi)-th element of the vector v, 
which corresponds to the state |jii2«3*4), into an element 
((*2 — l)w -f ii, (is — l)m -f ii) of a matrix V . Then, the 



as the following matrix-matrix multiplications 

where H^ denotes the transpose of H . Similarly, when 
the same element is put into the element ((zs — l)n -t- 
*2, (z4 — l)?Ti-f ii) of another matrix Vc, the multiplication 
HrV is rewritten into 



HrV,. 



While the matrix Hi, Hr, and H, are sparse matrices, 
the matrices V and Vc are complete dense ones since 
these are formed by elements of the superblock vector v. 
This indicates that the parallel calculation for the matrix- 
vector multiplication Hv can be effectively executed by 
partitioning the matrices V and Vc- This parallelization 
scheme has been successfully employed in the exact di- 
agonalization. The details of the parallel scheme and ef- 
ficiency were reported in [^Q. By using this algorithm, 
one can extend DMRG to arbitrary n-leg model as long 
as computation resources are unlimited. In addition, the 
direct method has several advantages. The application of 
the periodic boundary condition is not a problem at all, 
and the extension to the time-dependent, the dynamical, 
and the finite temperature DMRG [J is straightforward. 
However, it should be noted that the increment of the lad- 
der leg enlarges not only the dimension of the Hamilto- 
nian matrix but also that of the density matrix. Although 
the size of the density matrix becomes not so large, its 
diagonalization needs m eigenstates and its CPU time 
cost becomes non-negligible with the leg increment. The 
parallelization in terms of the density matrix is another 
difficult issue, since the size of the block diagonal matri- 
ces inside the density matrix can not be predicted prior 
to the execution. The parallelization should be adaptive 
to the dynamical change of the size. Its algorithm and 
technique will be published elsewhere fl^. In this pa- 
per, we restrict ourselves within the parallelization for 
the Hamiltonian matrix operation, since it is the most 
primary issue for 2-D extension. The maximum leg sizes 
in this paper are 9 (its results are not shown) and 5 for 
Heisenberg and Hubbard model, respectively, due to the 



limitation of CPU resource [11 1 



Let us present calculation results of the direct DMRG. 
Figure [3] (a) and (b) show how the ground state energy 
converges with repeating the sweep for 5 (leg) x 10-site 



J xy 



1) and the 3 x 10-site 



multiplication HiS. 



V and HrSi-^i^-i' i' v are rewritten 



Heisenberg model {Jz 
Hubbard model {U/t = 10) with 28 fermions (14 t, 14 j). 
The open boundary condition is applied to both models. 
These results demonstrate that both models converge 
to their ground state within once or twice sweeps [12l |. 
Moreover, the ground state energy sufficiently converges 
with m ~ 128 and ~ 256 for Heisenberg and Hubbard 
model, respectively. These features clearly prove that 
the direct DMRG method is quite excellent in the accu- 
racy and the convergence properties in contrast to the 
multichain algorithm [6]. 
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FIG. 3: 

The ground state energy vs. the sweep counts for (a) 
5(leg) X 10-site Heisenberg model and (b) 3 x 10-site 
Hubbard model, m is the number of states kept. 



Next, we present the performance of the direct DMRG 
method. A test example is the two-dimensional 7 x 10- 
site Heisenberg model. Fig. |4] (a) shows how CPU time 
decreases with increasing CPU number, and how the par- 
allel scalability depends on m. The latter effect is com- 
prehensible through a comparison among m ~i2, 64, and 
128. It is clear that the parallelization effect is improved 
when the number of states kept m increases. This is 
because the size of the Hamiltonian matrix to be diag- 
onalized grows with rn, and the parallelized operation 
counts increases. This behavior is common for 4 x 10- 
site Hubbard model with 38 fermions (19 t, 19 |) as 
shown in Fig. H (b). Thus, the direct 2D DMRG is a 
suitable application for parallel computer, since the true 
ground-state exploration requires sufficiently large m. 

Let us analyze details of the performance of the paral- 
lel direct DMRG to discuss the feasibility of the present 
algorithm for larger m and leg systems. Figure [5] (a) 
shows how CPU time of three main routines enlarges 
with increasing m. The three routines are the den- 
sity matrix formation (DMF), the density matrix diag- 
onalization (DMD), and the Hamiltonian matrix exact- 
diagonalization (HMED) . The target model is 3-leg Hub- 
bard model and 128 CPU's are used in all cases for a com- 
parison. One notices that the cost of HMED especially 
grows with increasing m. If HMED is not parallelized, 
then CPU cost for HMED is found to be too huge. The 
parallelization for HMED is clearly crucial. Fig. O (b) is 
a leg number dependence of CPU time balance for the 
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FIG. 4: The CPU number dependence of the total CPU time 
for (a) 7(leg) x 10-site Heisenberg model on Altix3700Bx2 with 
m = 32, 64, and 128. (b) The same dependence for 4 x 10 
Hubbard model. 
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FIG. 5: (a) The m dependence of CPU cost distribution for 
three main routines (see the text) of the present DMRG code. 
The model is 3 (leg) x 10-site Hubbard model. CPU time of 
each routine is averaged per step of DMRG. For every case, 
128 CPU's on Altix3700Bx2 are used, (b) The leg number 
dependence of CPU cost distribution for the Hubbard model. 
In all cases, m = 64, and 128 CPU's are used. 
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FIG. 6: (a) The hole density profile for 4(leg) x 20-site Hub- 
bard model {U/t = 10) with 76 fermions (38 T, 38 [ )and 
m = 160. (b) The ladder direction dependence of the max- 
imum value of the local staggered spin density along the leg 
direction for the same model with m = 64, 96, 128, and 160. 



three routines with m = 64. Although HMED sustains 
the position as the heaviest routine on the increment of 
the leg number, it is noted that the sum of DMF and 
DMD is comparable to HMED for 5-leg model. These 
results indicate that CPU costs in terms of the density 
matrix also becomes a bottleneck for larger leg cases. 

Finally, let us examine the validity of the ground state 
obtained by the present direct DMRG. We pay attention 
to the Hubbard model with just below the half-filling. 
The reason is that there is a controversial issue whether 
the stripe is the ground state or not [31 . We calculate 
spatial profiles of the hole density 



and the staggered spin density 

s{x,y) = {nx,y^ - flx.yd) 7 



(2) 



(3) 



with an open boundary condition for each direction. 
Here, n^^y^a (c =Ti i) is the density operator and (• • • ) 



denotes the ground state expectation value. One expects 
s{x, y) = for any local sites in the ground state of the fi- 
nite ladder Hubbard model, even if the hole density mod- 
ulation survives. This is a consequence of Lieb-Mattis 
theorem [13] . We calculate the ground state profiles of 
the 4 X 20-site Hubbard model with 76 spins (38T, 38^) 
at U/t = 10 with varying from m = 64 to m = 160. 
Figure [5] (a) shows the hole density profile for m = 160, 
while Fig. 6(b) presents m dependence of the ladder di- 
rection profiles of the maximum value of the staggered 
spin density along the leg-direction given as 



S{x) — max |s(x, y)\ 



(4) 



We point out that the present DMRG method rapidly 
converges non-polarized pattern for the staggered spin 
density profile with increasing m, although the hole den- 
sity one shows a stripe structure. These results are differ- 
ent from those of the multichain algorithm in which 
an extrapolation is required to remove the artificial pro- 
file of the spin density. To our knowledge, such a direct 
convergence is the first result in DMRG calculation of 
the 2-D Hubbard model. 

We developed a 2-D directly-extended code of DMRG. 
We parallelized the exact diagonalization part by trans- 
forming the superblock vector into the matrix and dis- 
tributing the elements. This parallel scheme becomes 
more effective as the number of states kept m increases. 
In addition, we confirmed in the repulsive 2-D Hubbard 
model that the stripe observable in the hole density pro- 
file is not artificial because the spin-density modulation 
as its counterpart disappears with increasing m accord- 
ing to Lieb-Mattis theorem. We believe that the present 
direct 2-D DMRG will give a great impact on the ground 
state exploration in atomic gas, solid state, and other 
systems, by the future use of advanced parallel comput- 
ers. 
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